A Software-Based Hardware Fault Tolerance Scheme for Multicomputers
نویسندگان
چکیده
A hardware fault tolerance scheme for large multicomputers executing time-consuming non-interactive applications is described. Error detection and recovery are done mostly by software with little hardware support. The scheme is based on simultaneous execution of identical copies of the application on two subnetworks of the system. Normal system operation is periodically suspended and the logical states of the two subnetworks are synchronized. Errors are detected by comparing the ‘‘frozen’’ synchronized states of the two subnetworks while they are being saved as ‘‘checkpoints’’ for possible subsequent use for error recovery. Algorithms for error detection and recovery using this scheme are discussed.
منابع مشابه
Fault-Tolerant Multicasting in Multistage Interconnection Networks
In this paper, we study fault-tolerant multicasting in multistage interconnection networks (MINs) for constructing large-scale multicomputers. In addition to point-to-point routing among processor nodes, efficient multicasting is critical to the performance of multicomputers. This paper presents a new approach to provide fault-tolerance multicasting, which employs the restricted header encoding...
متن کاملAlgorithm - Based Fault - Tolerant Strategies in FaultyHypercube and Star
This dissertation addresses the design of algorithm-based fault-tolerant strategies in faulty hypercube and star graph multicomputers without hardware modi cation. Several new concepts and designs are presented here under the permanent and transient fault models. Under the permanent fault model, we propose a new fault-tolerant recon guration scheme in the faulty hypercube and star graph multico...
متن کاملFault Tolerance for Multiprocessor Systems Via Time Redundant Task Scheduling
Fault tolerance is often considered as a good additional feature for multiprocessor systems but nowadays it is becoming an essential attribute. Fault tolerance can be achieved by the use of dedicated customized hardware that may have the disadvantage of large cost. Another approach to fault tolerance is to exploit existing redundancy in multiprocessor systems via a task scheduling software stra...
متن کاملDesign and Analysis of Transient Fault Tolerance for Multi Core Architecture
This paper describes the software approach of fault tolerance for shared memory multi core system using PLR.PLR uses a software-centric approach transient fault tolerance which ensuring a correct software execution. This scheme is used at user space level which does not necessitate changes to the original application.PLR create a set of redundant process per application process. In this scheme ...
متن کاملHigh-Coverage Fault Tolerance in Real-Time Systems Based on Point-to-Point Communication
The distributed recovery block (DRB) scheme is a widely applicable approach for realizing both hardware and software fault tolerance in real-time distributed and parallel computer systems. One of the most important extensions of the DRB scheme which were outlined in recent years but not developed fully is the integration of the DRB scheme and a network surveillance (NS) scheme. We recently deve...
متن کامل